\begin{savequote}[8cm]
Science is what we understand well enough to explain to a computer. Art is
everything else we do.
\qauthor{Donald Knuth}
\end{savequote}

\chapter{Measurement and structure}
\label{ch:structure}
%\todo{Importance of statistical analysis and random models to check for
%  meaningful statistical deviations}

As discussed in the previous chapter, location-aware capabilities are gradually
being offered as a feature by many online services.
People appear more willing to share
information about their geographic position with friends, while companies can
customise their services by taking into account where the user is located.
%Thanks to this growing popularity of location-based features offered by online
%services, in particular by social networking platforms, users are now
%increasingly familiar with the act of sharing information about the places they
%visit. 
Therefore, online platforms are able to gather data not only on
social interactions, but also on users' physical movements. Such information is
available for the first time at unprecedented scales, posing new
questions about what spatial properties of user behaviour it might reveal.

In this chapter our aim is to exploit this simultaneous availability of social
and spatial data given by online services  to study the spatial properties
exhibited by the social connections arising among users. By embedding the 
social graphs in physical space,  we will focus on understanding whether, and how,  spatial and social
characteristics are related to each other.   Our findings suggest that
geographic space influences users in a heterogeneous way; hence,  we will
present new social network measures that simultaneously capture social and spatial
aspects. These measures allow users to be distinguished quantitatively according
to their geographic and social characteristics, enabling 
comparative analysis of different spatial social graphs.

\paragraph{Chapter outline}
%activity on a popular location-based social service, Gowalla, focusing on
%understanding and quantifying the statistical properties of user behaviour.  In
%particular, the check-in dynamics gives rise to a new dimension of user activity
%which was not present in previous online social tools. Since digital records of the
%interplay between users and locations are for the first time available for
%analysis, we focus on a primary and fundamental issue:  measure how
%users take advantage of social and location-based features, looking for
%differences and similarities between how users engage with place check-ins and
%with online friends.

In Section~\ref{sec:methodology} we present and describe our data collection
methodology, which has allowed us to acquire extensive traces about the geographic
and social properties of three popular online location-based social services. These digital records
about the interplay between online friendship ties and user locations make it
possible  to pursue a set of further investigations.

In Section~\ref{sec:socio_spatial} we take advantage of
user check-ins to assign a geographic position to each individual, embedding
social connections in space.  This allows us to study the spatial properties of
online friendship ties, with emphasis on the effect that 
geographic distance might have on online relationships. Our analysis focuses on whether
space homogeneously affects users or if, instead, different individuals tend to
have online ties spanning different geographic distances, leading to a more heterogeneous system. By
adopting randomised null models we are able to
investigate the statistical significance of the empirical properties found in
these networks.  Furthermore, we discuss how social and spatial factors jointly
influence the structure of connections between users, resulting in heterogeneity
between users.

Since our analysis suggests that space influences the social connections
established by online users, we then aim to include the effect of
spatial distance in standard network measures.  Thus, in
Section~\ref{sec:metrics} we define two novel geosocial measures that combine
standard network properties with spatial distance.  We demonstrate the
effectiveness of these measures in
capturing the impact of geographic distance on online social ties with a case
study; we compare location-based social
services to other online platforms where location-sharing features are not dominant,
discussing observed similarities and differences. Using our measures we
discover that spatial constraints do not uniformly affect different categories
of online social services; users of location-based services exhibit a stronger preference
for short-range social connections than users of content-sharing platforms. We
discuss the implications of our findings in Section~\ref{sec:disc3}, while
Section~\ref{sec:related3} reviews related results and
Section~\ref{sec:concl3} concludes the chapter.

\section{Data collection methodology}
\label{sec:methodology}
%The emergence of location-based social services has opened access to a valuable
%source of data about the geographic location of users, as well as to online
%friendship connections among them. This offers a perfect opportunity to gather
%data useful to study and understand the spatial properties of the social
%networks arising among online users.  
Our first goal has been to collect extensive traces  about the spatial and geographic
properties of online social services.  We have acquired traces
from three different online social platforms. Each service offers
location-sharing features, such as check-ins, or status updates where places are
tagged, as well as  social networking features. 

As we detail in this section, for each service we gathered data about both
social ties and user check-ins.  This allows us to extract the social network
arising among users and to assign a geographic \textit{home location} for each
user, effectively embedding the social graph in geographic space.  When users do
not explicitly specify their location,  we assign to each user the geographic
location of the place where he/she has made the greatest number of his/her
check-ins.  We discard users who have no check-ins.



\subsection{Brightkite}
%Brightkite was founded in 2007 as a social networking website which allowed
%users to post status updates tagged with real-world places and establish
%mutual friendship links. We study a dataset
%collected in September 2009 which includes friend lists and place-tagged
%messages for the entire Brightkite user base at
%that time.  
Brightkite was founded in 2007 as a social networking website that allowed users
to share their location, to post notes and to upload photos through different
interfaces.  It was initially based on the idea of performing check-ins at places,
  where users can see who is nearby and who has been there before.  Brightkite
  users could establish bidirectional friendship links and send public and
  private messages to each other.   Brightkite has recently transformed its
  service, offering a group-messaging mobile application.

Our measurement took place when the service was still mainly based on
mobile location-sharing. Brightkite  offered a public API which provided
geographic coordinates of user home locations and lists of friends.  We fetched
data from their API in September 2009; we seeded a crawler with 1,000 users
randomly selected from the public timeline offered by Brightkite  and then we
exhaustively retrieved friends and followed social links until we collected all
users connected to the seed users.  The resulting dataset contains information
about 54,190 users; it represents a complete snapshot of a location-based social
platform in its initial evolution phase, including the entirety of its social
graph.

\subsection{Foursquare}
Foursquare is a location-based social networking service launched in 2009 which
engages its users in a competition.  Users check in at venues in order to
be awarded points that contribute to their position on a leaderboard. In addition, users
can publish and share tips and suggestions related to the places they visit.  
The service also enables the creation of bidirectional friendship links.

Foursquare does not provide public access to user check-ins. However,
many Foursquare users choose to push their check-in
messages to Twitter automatically, which provides a public API to search and download
these messages. By using this API, we have recorded approximately 4 million tweets,
each one containing a check-in made  by a Foursquare user during June
2010. These messages come from about 250,000 different users and cover
about 1.5 million locations on the planet.  We estimate that our sample
contains approximately 20\% to 25\% of the entire Foursquare user base
at collection time, or between 40\% and 50\% of all active Foursquare
users~\footnote{Foursquare reached 1 million users in April
  2010~\cite{Sie10:foursquare}.}.

Since Foursquare does not provide unauthorised access to users' friend lists, we
have acquired friendship ties that Foursquare users have between them on Twitter,
where they are publicly available, to extract a social network. Our
assumption is that a friendship connection between two Foursquare users is
likely to be present also on Twitter, if the two users are Twitter users as
well. While this resulting social graph is not expected to be identical to
the original Foursquare graph, we will show that it conveys meaningful
information, providing results comparable to the other datasets. 

\subsection{Gowalla}
Gowalla is a location-based social networking service created in 2009.
Users can check in places through a dedicated mobile application; such
check-ins are then pushed via notifications to other friends on the service and,
by linking accounts, to Twitter and Facebook.  The friendship relationship is
mutual, requiring each user to accept friendship requests to allow location
sharing. Gowalla  was discontinued at the end of 2011 as the company was
acquired by Facebook.

We acquired a complete snapshot of Gowalla in August 2010.  Gowalla
provided a public API offering access to information about user profiles, friend
lists, user check-ins and place details.  For every user we have gathered the
user profile, the friends list and the list of all the places where the user has
checked in. The API did not provide unauthorised access to fine-grained
temporal information about user check-ins. However, the API presented the
timestamps of the earliest and latest check-ins for each place where a user had
checked in.
Since users were identified by consecutive numeric IDs, we were able to
exhaustively query all user accounts and download all the aforementioned
information.  

\section{The spatial structure of online social networks}
\label{sec:socio_spatial}
%Among the potential research questions that are triggered by the wealth of
%social and geographic data available on online services,
%We start our exploration of the spatial stru
Our exploration now focuses on understanding the spatial structure of the social
graphs in these online services.  Specifically, we want to focus at first on
global properties, studying the effect of geographic distance on online ties
and the likelihood of connection between individuals at different distances
from each other.
Then, given the heterogeneity shown by individuals in social networks,
exhibiting a wide variability in their number of connections, we expect their
spatial properties to be similarly heterogeneous. In other words,
space could homogeneously affect users or, instead, some individuals may
exhibit a preference for long-distance connections.

In this section we address these issues with a comparative study of data
collected from the three services described previously.  
%Each
%service offers data about social ties and user locations, enabling the study of
%socio-spatial properties.  
At first, we focus on the global spatial properties of the social graphs: we
study whether distance affects social ties by looking at the likelihood of
connection between users as a function of their geographic distance. We then
shift the focus of our analysis to individual users, showing that both their
connections and the triads they belong to are influenced by spatial proximity.
  
We assess the statistical significance of the observed spatial properties by
designing two randomised null network models; the comparison of the real graphs
with the null graphs sheds light on whether spatial and social factors are
influencing the structure of the network.  Overall, we observe robust and
universal features across the three services; this suggests that there might be social processes
that are not specific to one particular online tool adopted by users.


\begin{figure}[t]
  \centering
      \includegraphics[width=\plotscale]{images/three_degree_distributions_inset.pdf}
    \caption{Empirical Complementary Cumulative Distribution Function (CCDF) of the
        number of friends for each user in Brightkite, Foursquare and Gowalla. The inset shows
            the same distributions rescaled by dividing by the average number
            of friends in each network; the three datasets fall on the same
            curve. }
        \label{fig:three_degree_dist}
\end{figure}
\begin{table}[t]
\centering
\begin{tabular}{|c||c|c|c|c|c|c|c|c|}
\hline
Service & \emt{N} & \emt{K} & \emt{N_{GC}} & \emt{\langle k \rangle} & \emt{\langle C
  \rangle} &
\emt{H_{EFF}} & \emt{\langle D \rangle} & \emt{\langle l \rangle}  \\
\hline
\hline
Brightkite & 54,190 & 213,668 & 50,896 & 7.88 & 0.181 & 5.73  & 5,683 & 2,041 \\
\hline
Foursquare & 258,706 & 2,854,957 & 254,532 & 22.07 & 0.191 & 5.90 & 8,494 & 1,442 \\
\hline
Gowalla  & 122,030 & 577,014 & 116,910  & 9.28 & 0.254  & 5.43 & 5,663 & 1,792 \\
\hline
\end{tabular}
\caption{Properties of the traces: number of nodes \emt{N} and edges \emt{K} in the
    social network, number of nodes in the largest connected component
      \emt{N_{GC}},
average node degree \emt{\langle k \rangle}, average clustering
   coefficient \emt{\langle C \rangle}, 
   90th percentile of shortest path length distribution \emt{H_{EFF}}, 
   average geographic distance between nodes \emt{\langle
   D \rangle} [km], average link length \emt{\langle l
    \rangle} [km].} 
    \label{tab:spatial_dataset_properties}
\end{table}



\subsection{Global spatial properties}
\label{subsec:global_spatial}
We study the spatial properties of the social networks by representing them as
\textit{spatial graphs}, where nodes 
are positioned in a space equipped with a metric. In our
case, online users are located over the 2-dimensional surface of the Earth and we
adopt the great-circle distance as our metric; the distance \emt{D_{ij}} between any two
nodes \emt{i} and \emt{j} is then computed given their geographic coordinates.
Then, the social network can be represented as an undirected graph \emt{G} with \emt{N}
nodes and \emt{K} links, with users as nodes and where a link exists for each social
tie (i.e., a person lists another user as one of
his/her friends). 
We assign a length \emt{l_{ij}} to each social link so that
\emt{l_{ij} = D_{ij}}.
Each social link may be undirected or directed; in the latter case, the
existence of a link from node \emt{i} to node \emt{j} does not imply the
existence of the reverse link from \emt{j} to \emt{i}. Unless explicitly
specified, we always consider undirected connections in our graphs.




The general properties of these three datasets are reported in
Table~\ref{tab:spatial_dataset_properties}. The social networks are heterogeneous in
size, ranging from 54,190 nodes in Brightkite to 258,706 in Foursquare; the
average degree is lower in Brightkite and Gowalla, respectively 7.88 and 9.28,
than in Foursquare, where users have on average 22.07 friends.
In all the networks, the largest connected component, also known as the giant
component, contains the vast majority of the users.
The degree distributions for the three networks are reported in
Figure~\ref{fig:three_degree_dist}; they all show a heavy tail, with some
users having thousands of friends. Rescaling the degree
distributions by dividing by their average values results in a common trend,
as shown in the inset. These networks also exhibit high
values of the average clustering coefficient, between 0.18 and 0.26, and
short topological distances between their nodes, with 90\% of all pairs
being fewer than 6 hops away from each other. These properties confirm the small-world
nature of these networks, as found in many other online social
systems.
\begin{figure}[t]
  \centering
      \includegraphics[width=\plotscale]{images/three_spatial_distances_cdf.pdf}
    \caption{Empirical Cumulative Distribution Function (CDF) of the geographic distance
        between all users (dotted line) and between connected friends (solid line) for the
            three datasets. Logarithmic binning has been adopted to estimate the
    number of samples in each range of values.}
        \label{fig:three_edge_length_dist}
\end{figure}

The average geographic distance between users \emt{\langle D \rangle} is consistently
larger than the average distance between friends \emt{\langle l\rangle} across
all the datasets; while the first value ranges between 5,600 and 8,500 km, the
latter has much smaller values, between 1,400 and 2,000 km. This already provides
evidence that the probability of a social link between two users
decreases with distance; we will further investigate this
relationship later. The distribution of social link length is comparable across
the three datasets, as shown in Figure~\ref{fig:three_edge_length_dist}: about
40\%-50\% of all couples of friends are within 100 km, with more than
3\% of all links being shorter than 1 km. 
The 
distribution of distances between users, also depicted in
Figure~\ref{fig:three_edge_length_dist}, is different; about 50\% of
users are at distances larger than 4,000 km, across all the networks.


\begin{figure}[t]
  \centering
      \includegraphics[width=\plotscale]{images/three_friendprob_geo.pdf}
    \caption{Probability of friendship between two users as a function of their
    geographic distance, for the three datasets under analysis. Logarithmic binning has been adopted to
        estimate the probability in each range of values.
   %   The two straight
   % lines represent probability \emt{P(d) \sim d^{-\alpha}} with two different
   % exponents \emt{\alpha=0.5} and \emt{\alpha=1.0}.
    }
        \label{fig:three_friendprob}
\end{figure}
\subsection{The effect of distance on online friendship}
\label{subsec:friendprob}
To further investigate whether social links are more likely to exist between
geographically close rather than distant users, we estimate the probability of
friendship \emt{P(d)} as a function of distance \emt{d} by counting \emt{L_d}, the
number of social links with length \emt{d}, and by estimating \emt{N_d}, the number of
pairs of users at distance \emt{d}. This gives us \emt{P(d) = L(d) / N(d)}. As
discussed before, this relationship has been found in other studies to be close to a law
\emt{P(d) \sim d^{-\alpha}}, with values of \emt{\alpha} ranging between 1 and
2.%~\cite{Liben2005,Lambiotte_2008,Backstrom_2010,Goldenberg09}.  

As reported in
Figure~\ref{fig:three_friendprob}, our datasets present noisy patterns: we
notice an almost flat probability in the range 1-10 km, then  all curves decrease as distance
grows, reaching another steady probability between 1,000 and 4,000 km. This
final plateau  might denote a background probability that affects
ties spanning more than 1,000 km. Similar constant trends at short and long geographic
  ranges have also been found in other online systems~\cite{LNKR05:routing, BSM10:findme}.
The appearance of social ties longer than 4,000 km is constrained by
the fact that both Europe and North America, where a large proportion of users are
based, are not large enough to allow such long-range connections and the
distance between these two regions is about 6,000 km.

We find that our traces are closer to a decay \emt{d^{-\alpha}} with \emt{\alpha =
0.5}, whereas larger exponents have been found in other similar studies; it
seems that in the location-based services under analysis
long-range social ties have a relatively higher probability of occurrence than in other
social systems.
A potential explanation of this behaviour is that these platforms are relatively new, so
they have mainly attracted early adopters. These users tend to be tech-savvy,
with many existing long-distance online friendship ties which they
bring to these services.  This might not happen in
other types of social networks, such as those  extracted from mobile phone
communications or Facebook interactions, which are already mature.

Indeed, mobile phone connections exhibit a larger exponent \emt{\alpha} than
online social networks: phone conversations are much more constrained by
geographic distance than interactions on Facebook. This might be due to the
fact that mobile phones represent a mature technology, adopted by the vast
majority of the population. Also, mobile phone calls could often
be exchanged to arrange face-to-face meetings, which take place between
spatially close individuals, as discussed by Calabrese et al. in a
large-scale study of mobile phone call records~\cite{CSB11:mobiles}. 
As location-based services become more mainstream their user
audience may broaden and include individuals who are affected by distance in
a stronger way. 


\subsubsection{Decoupling social and spatial factors: network randomisation}
After these initial investigations, we assess the statistical significance
of the empirical spatial properties of these networks using two
\textit{randomised models}, which capture either the geographic or the social
properties of the original social networks and randomise everything else:

\begin{itemize}
\item \textbf{Geo}: this null model keeps the user locations unmodified and then
creates a social link between two users at distance \emt{d} according to the 
relative probability of friendship \emt{P(d)} (as reported in
        Figure~\ref{fig:three_friendprob}).
\item \textbf{Social}: this null model keeps the social connections  as they
are, shuffling at random all user locations.
\end{itemize}

The overall properties of these models are a direct consequence of their
definition. Both models result in a network with exactly the same number of
nodes and, on average, the same number of edges. The Social model has the 
social properties of the original network, including degree distribution,
clustering coefficients and topological network distances, but link geographic lengths are
distributed as the pairwise user spatial distances: as a result, the
average link length becomes higher than in the original network, with
\emt{\langle l_{SOCIAL} \rangle = \langle D \rangle}. On the other hand, the Geo
model has the same distribution of link geographic lengths as the original network,
so that \emt{\langle l_{GEO} \rangle = \langle l \rangle}, but the social
properties are now lost: the degree distribution is peaked and has no
heavy tail, while the average clustering coefficient is much lower, since
there are fewer social triads. However, the two network models present
similar distributions of topological distances, with about 90\% of all pairs of
nodes always within 6 hops. 

We will exploit these two null models by comparing their properties to those 
of the real networks, in order to understand whether the observed
socio-spatial characteristics  might be explained in terms of simple geographic
or social factors. Every analysis performed on a randomised null model
will be averaged over 100 different random realisations.


\subsection{User spatial properties}
We now focus on individual users, studying the extent to which their social ties stretch across
space. We define 
\begin{equation}
\label{eq:friend_distance}
w_i = \frac{1}{k_i}\sum_{j \in \Gamma_i} l_{ij}
\end{equation}
\noindent to be the \textit{friend distance} of user \emt{i}, where \emt{\Gamma_i} is the
set of neighbours of node \emt{i} and \emt{k_i = |\Gamma_i|} is its degree. The overall
distribution of \emt{w} is reported in Figure~\ref{fig:avg_friend_dist} for the
original social network and for the two randomised versions. The existence of
values over all geographic scales is due to the existence of users with different
characteristic lengths of interaction. For instance, about 10\% of users have
connections with an average length shorter than 10 km, whereas around 20\% of users
have values of friend distance above 2,000 km.
%^In particular, 
%   since this distribution closely matches the aggregated link length distribution in 
%Figure~\ref{fig:three_edge_length_dist}, 
Links with different geographic lengths do
not appear homogeneously across all users; instead, there is heterogeneity
between users, with some having only short-range connections and others with
long-distance ties. These correlations are stronger than one would expect by
chance; in fact, the two randomised models suggest that values of \emt{w} should be
more peaked around the average, not spread over a large range of
magnitudes.
\begin{figure}[t]
  \centering
    \subfigure[Brightkite]{
      \includegraphics[width=\tripleplotscale]{images/bkite_dist_strength_cdf.pdf}}
    \subfigure[Foursquare]{
      \includegraphics[width=\tripleplotscale]{images/fsquare_dist_strength_cdf.pdf}}
    \subfigure[Gowalla]{
      \includegraphics[width=\tripleplotscale]{images/gowalla_dist_strength_cdf.pdf}}
    \caption{Empirical Cumulative Distribution Function (CDF) of the friend
        distance \emt{w} for each user in the social network, together with the
            distributions obtained in the randomised models.}
        \label{fig:avg_friend_dist}
\end{figure}

\begin{figure}[t]
  \centering
    \subfigure[Brightkite]{
      \includegraphics[width=\tripleplotscale]{images/bkite_null_distance_degree.pdf}}
    \subfigure[Foursquare]{
      \includegraphics[width=\tripleplotscale]{images/fsquare_null_distance_degree.pdf}}
    \subfigure[Gowalla]{
      \includegraphics[width=\tripleplotscale]{images/gowalla_null_distance_degree.pdf}}
    \caption{Average distance strength \emt{s(k)} as a function of node degree \emt{k}
        for the original network and for its two randomised versions. Each trend
            is fitted by a law \emt{s(k) \sim k^\beta}.}
        \label{fig:distance_strength}
\end{figure}

Another interesting result is obtained by studying the correlation between the 
friend distance \emt{w_i} and the degree \emt{k_i}. We study the user \textit{distance
strength}~\cite{BBP04:strength}, defined as 
\begin{equation}
s_i = \sum_{j \in \Gamma_i} l_{ij} = k_i w_i
\end{equation}
\noindent and then we compute the average distance strength \emt{s(k)} for all users
with degree \emt{k}. In the absence of any correlation, this measure should be linearly
correlated with the degree \emt{s(k) \sim k \langle l \rangle}, while a relation of
the form \emt{s(k) = A k^\beta} with \emt{\beta \neq 1} or \emt{A \neq \langle l
  \rangle}
would imply correlation between the distance strength and the degree. In
particular, \emt{\beta > 1} signals that users with more friends tend to have longer
connections than users with fewer friends. This relationship is reported in
Figure~\ref{fig:distance_strength} for the three datasets under analysis: we
obtain values of \emt{\beta} in the range 1.10-1.18 across the different networks,
showing weak positive correlation.
Real data reveal  a pattern much closer to the Social
model, which has \emt{s(k) \sim k}, with \emt{\beta=1}, than to the Geo
model, which has much lower values of \emt{\beta} in the range
0.2-0.4, showing negative correlation between node degree and 
friend distance. The case of the Geo model suggests that if only spatial
factors were shaping social connections, then users would accumulate several
links only when these links are predominantly covering short geographic
distances.
In reality, as users add more and more friends, on average their link
length slightly increases. This contrasts with what is found in the null models,
providing evidence that users with more connections tend to have friends
further away. 

\begin{figure}[t]
  \centering
      \includegraphics[width=\plotscale]{images/triangle_prob.pdf}
    \caption{Probability that a social link belongs to a triangle as a function
      of its geographic length for the three datasets under analysis. The solid
        lines show the average probability that a link belongs to a triangle for
        each network.}
        \label{fig:triangle_prob}
\end{figure}

\begin{figure}[t]
  \centering
    \subfigure[Brightkite]{
      \includegraphics[width=\tripleplotscale]{images/bkite_null_tri_stats.pdf}}
    \subfigure[Foursquare]{
      \includegraphics[width=\tripleplotscale]{images/fsquare_null_tri_stats.pdf}}
    \subfigure[Gowalla]{
      \includegraphics[width=\tripleplotscale]{images/gowalla_null_tri_stats.pdf}}
    \caption{Empirical Cumulative Distribution Function (CDF) of average triangle link length
        \emt{\langle l_{\Delta} \rangle}  for the original network and for the two
            randomised models.}
        \label{fig:null_tri_stats}
\end{figure}
\subsection{Spatial properties of triads}
We now shift our attention to understanding the geographic properties of social triangles.
Social networks usually present many triads, resulting in high values of
clustering coefficient. Our networks exhibit similar patterns, with
clustering values between 0.18 and 0.26. We extract 377,438 triangles
in the Brightkite social networks, 18,764,129 in Foursquare and 1,327,559 in
Gowalla. Between 70\% and 86\% of all links in each social network belong to at
least one triangle, given the highly clustered structure of these social
networks.  We find that social triangles arise at a wide range of geographic
lengths. Investigating the probability that a link belongs to a
triangle as a function of its length provides a surprising result: this
probability is largely unaffected by distance, as shown in
Figure~\ref{fig:triangle_prob}. A link is equally likely to
belong to a social triangle regardless of its length. A related result was found
by Lambiotte et al.~\cite{LBD08:mobile}: many spatially
local clusters of people tend to appear in mobile phone communication networks,
      with social links below 40 km more likely to participate in social triads,
      but then this likelihood reaches a constant value for longer links. As we
      have already seen, online behaviour appears less sensitive to distance than
      mobile phone communication.  Overall, the trend that longer links equally
      participate in social triangles holds also in our datasets.

To assess user heterogeneity, for each social triangle we compute the
\textit{geographic triangle length}
\emt{l_{\Delta}}. That is, 
   we compute the arithmetic average of the spatial length 
of the three links that constitute the triangle.
Then, we compute
\emt{\langle l_{\Delta} \rangle_i} for each user \emt{i}, which averages
\emt{l_{\Delta}} over all 
the triangles he/she belongs to. This value does not take into
account how many triangles a user might belong to, as the clustering coefficient
does so
by normalising with respect to the number of potential triangles.
Instead, we aim to assess merely the geographic span of a user's social
triangles, however many there are.  In Figure~\ref{fig:null_tri_stats}
we show the distribution of \emt{\langle l_{\Delta} \rangle} over all users:
triangles with different geographic span do not arise equally among all
users, but instead there are users with smaller triads and users with wider
ones.  

\begin{figure}[t]
    \centering
    \subfigure[Brightkite]{
      \includegraphics[width=\tripleplotscale]{images/bkite_null_shape_degree.pdf}}
    \subfigure[Foursquare]{
      \includegraphics[width=\tripleplotscale]{images/fsquare_null_shape_degree.pdf}}
    \subfigure[Gowalla]{
      \includegraphics[width=\tripleplotscale]{images/gowalla_null_shape_degree.pdf}}
        \caption{Average geographic triangle link length \emt{\langle l_{\Delta} \rangle}
            for all users with degree \emt{k},  as a function of degree
              \emt{k}.  Results for the original network and for the two randomised models.}
        \label{fig:null_tri_degree}
\end{figure}

For example, there are at least 20\% of users with an average triangle
length below 100 km, while the top 20\% have values above 2,000 km. This
heterogeneity is much higher than one would expect if space did not matter, as
the Social model mainly exhibits values in the range 1,000-10,000 km. However,
if social mechanisms were not a factor at all, then social triads should
be smaller, as the Geo model shows. The existence of both local,
short-range triads and global, long-distance ones needs to be related to
the influence both of geographic distance and of social processes such as
homophily, triadic closure and focus
constraint~\cite{MSLC01:homophily, Gra73:ties, Fel81:focus}.

We further study this heterogeneity arising among users by computing the average
\emt{\langle l_{\Delta} \rangle} for users with a given degree, as a function of the degree. In these social networks
\emt{\langle l_{\Delta} \rangle} increases with the number of friends, as shown in
Figure~\ref{fig:null_tri_degree}.   This effect is not present in
the randomised networks: the Social model shows no correlation at all, while the Geo model
exhibits the opposite trend, with smaller triangles appearing among users with
higher degrees.  Apparently, there are both social
and geographic factors influencing social triangles, since having only
one type of factor does not reproduce the empirical data.

These results signal that users with fewer friends tend to generate social triangles on a
smaller geographic scale, while users with more friends belong to triangles
with longer links. This confirms  the strong connection between the number of
connections of a given user and the geographic distance of these friendship
connections.



%\begin{figure}[th]
%  \centering
%    \subfigure{
%        \label{fig:triangle_stats_a}
%      \includegraphics[width=.8\columnwidth]{images/tri_stats.pdf}}
%    \subfigure{
%        \label{fig:triangle_stats_b}
%      \includegraphics[width=.8\columnwidth]{images/user_tri_stats.pdf}}
%    \caption{
%            Cumulative Distribution of triangle link length (a) 
%        and empirical Cumulative Distribution of the average social triangle link length for each
%        user (b). Both  distributions closely follow the distribution of social link geographic length. }
%        \label{fig:triangle_stats}
%\end{figure}
%
%\begin{figure}[th]
%  \centering
%      \includegraphics[width=.8\columnwidth]{images/user_clust_degree.pdf}
%    \caption{Average clustering coefficient as a function of user degree.}
%        \label{fig:clust_degree}
%\end{figure}


%\begin{figure}[th]
%  \centering
%    \subfigure{
%        \label{fig:triangle_degree_a}
%      \includegraphics[width=.8\columnwidth]{images/user_shape_degree.pdf}}
%    \subfigure{
%        \label{fig:triangle_degree_b}
%      \includegraphics[width=.8\columnwidth]{images/user_shape_degree_norm.pdf}}
%    \caption{
%        Average social triangle link length for each user as a function of
%        user degree (a) and the same quantity normalised with
%            respect to the average link length of each social network (b). Only
%            users belonging to at least one social triangle are considered.}
%        \label{fig:triangle_degree}
%\end{figure}


%\begin{figure}[t]
%\centering
%      \includegraphics[width=.8\columnwidth]{images/bkite_gravity.pdf}
%    \caption{Average link length \emt{\langle l_{ij} \rangle} as a function of the
%        product of the node degrees \emt{k_i k_j}
%        for the original network and for its two randomised versions in Gowalla.
%    The other two networks exhibit similar patterns.}
%        \label{fig:gravity}
%\end{figure}

\section{Geosocial network measures}
\label{sec:metrics}
Online social networks are affected by geographic distance but,  as discussed
in the previous section, users exhibit large variations in the spatial
characteristics of their social ties. Since network measures have been
used extensively to differentiate the social properties of users,  our aim moves now towards defining measures that
combine network properties and spatial distance, allowing us to identify
also geosocial differences.  Measures that augment social
structure with geographic information would add a new dimension to social network
analysis and could enable a large number of practical applications related to
social systems. One example will be presented in Chapter~\ref{ch:caching}.

In this section we present two novel geosocial measures: the \textit{node
locality}, which quantifies how much an individual is connected to a local
rather than global set of friends, and the \textit{geographic clustering
coefficient}, which extends the standard notion of clustering coefficient by
taking into account the extent to which clusters of people are connected by short-range
ties. These measures offer a way to compare individual users inside
the same network but also to compare different spatial social networks,
regardless of the particular geographic space they span.

In order to explore the benefits offered by our measures, we apply them to four
different online social networks which provide location information for their
users. We compare social graphs with different spatial properties by
choosing two location-sharing social services, one blogging community and a social micro-blogging platform.
The range of functionalities offered by these services and the differences
arising with respect to their  semantics will enable us to use our measures
to understand the effect of spatial distance across different online platforms.

In particular, using our new measures we show that the four services present contrasting
characteristics, which may be explained by different attitudes of their users
towards the social and geographic aspects of online friendship. Location-based
platforms exhibit users with higher preference for short-range social connections than
sharing-based services, where social ties appear less constrained by spatial distance.

\subsection{Augmenting network measures with spatial distance}
\label{subsec:geomeasures}
%Again, nodes represent users and a link
%among two nodes exists if there is a social tie between them (e.g., a person
%    lists another user as one of his/her friends).  
%The main difference is that
%in this case a link may be undirected or directed: in the latter case, the
%existence of a link from node \emt{i} to node \emt{j} does not imply the
%existence of the reverse link from \emt{j} to \emt{i}.  
%Given a fixed location on the
%Earth for each user, nodes are embedded in a 2-dimensional metric space where
%the distance between two nodes \emt{i} and \emt{j} is given by the geographic distance
%\emt{D_{ij}} between their locations on the planet.  This distance is used as the
%length of the link \emt{l_{ij}} between nodes \emt{i} and \emt{j}. 

%AIM
%Highlight only extremely short-range social connections.
%Normalise this measure for nodes with various degrees.
%Allow networks at different geographic scales to be compared.
Our main motivation to introduce these measures is given by the heterogeneity we
have observed across users in Section~\ref{sec:socio_spatial}: in online social
services users with social ties, and social triangles, spanning only
short distances coexist with users having long-distance connections. Even though
we have observed correlations of spatial properties with the number of
connections users have, individual users could still deviate
from the average behaviour.


When defining our new geosocial network measures our chief objective is
to individuate  users who predominantly exhibit short-range connections: this
might be useful for a wide set of purposes, such as geographically caching data
related to online interactions~\cite{WPD10:locality} and distributing content
items by sending them where traffic requests are anticipated to
arise~\cite{THT12:tailgate}.
         Our aim is to create measures that take into account
the fact that users have different numbers of connections;  we achieve this by introducing normalisation by node degree. 
Our measures should capture spatial properties regardless of the
geographic scale of the social system. In other words,  each measure should be related
to the spatial size of the social graph, so that one can compare social
connections within a city to connections between individuals across a
large country.  We will highlight in our discussion how our measures meet 
these requirements. The definitions of our new network measures are based on the spatial
representation of a social graph already introduced in
Section~\ref{subsec:global_spatial}.  


\paragraph{Node locality}
The first measure quantifies the geographic closeness (i.e., the locality) of
the neighbours of a certain node to the node itself.  
Let us consider an
undirected geographic social network, a node \emt{i} with a particular geographic
position and the set \emt{\Gamma_i} of its neighbours. The node degree \emt{k_i} is the
number of these neighbours, that is \emt{k_i = |\Gamma_i|}. Then, the \textit{node
  locality} of \emt{i} can be defined as a measure of how geographically close
  its neighbours are and it is computed as follows:
\begin{equation}
\label{eq:locality}
NL_i = \frac{1}{k_i} \sum_{j \in \Gamma_i} e^{-l_{ij} / \beta}
\end{equation}
\noindent where \emt{\beta} is a scaling factor to avoid extremely
small values of node locality when links have large lengths. 
This definition fits the three main requirements that we discussed earlier.
By definition,
\emt{NL_i} is always normalised to be between 0 and 1, where the latter is achieved
only when all friends of a user are in the same location as the user herself.
The value of \emt{\beta} can be chosen so that
networks with different geographic size can still be compared to each
other, as we will discuss more in detail later. We adopt an exponential decay 
to highlight social ties that span over short geographic
distances and to penalise longer ties. Finally, normalisation by node degree is
needed to take into account the huge heterogeneity observed in the degree
distribution of such graphs.

In a similar way, in the case of directed graphs the \textit{node
in-locality} can be defined considering only the incoming connections of a
node
\begin{equation}
\label{eq:inlocality}
L_i^{IN} = \frac{1}{k_i^{IN}} \sum_{j \in \Gamma_i^{IN}} e^{-l_{ji} / \beta}
\end{equation}
\noindent where \emt{\Gamma_i^{IN}} is the set of its incoming neighbours and  
    \emt{k_i^{IN} =|\Gamma_i^{IN}|} is the in-degree of the node.  The \textit{node
        out-locality} is defined in a similar manner considering only outgoing
        links:
\begin{equation}
\label{eq:outlocality}
L_i^{OUT} = \frac{1}{k_i^{OUT}} \sum_{j \in \Gamma_i^{OUT}} e^{-l_{ij} / \beta}
\end{equation}

\paragraph{Geographic clustering coefficient}
While node locality captures how close the neighbours of a node are, another
measure is needed to quantify how connected the neighbourhood of a node
is over space. The \textit{geographic clustering coefficient} can be defined as an
extension of the clustering coefficient used for complex
networks. The clustering coefficient measures the
fraction of existing triangles among the neighbours of a given node, compared to
the number of possible triangles. This geographic
adaptation gives more weight to triangles when they are formed by nodes that are
close to each other than when nodes are at a greater distance.  The
geographic clustering coefficient of node \emt{i} is thus defined in the same way as
the clustering coefficient, but each existing triangle between nodes \emt{i}, \emt{j}
and \emt{k} is assigned a weight \emt{w_{ijk}}  defined as:
\begin{equation}
\label{eq:geoclust}
  w_{ijk} = e^{-\frac{\Delta_{ijk}}{\beta}}
\end{equation}
\noindent where \emt{\Delta_{ijk}} is the maximum length among the
three links, that is \emt{\Delta_{ijk} = max(l_{ij}, l_{ik}, l_{jk})}. We define
\emt{w_{ijk} = 0} if there is no link between \emt{j} and \emt{k}.  

Since this measure uses the maximum weight among all the links of a triangle, it
focusses on nodes that are all close to each other: when just one of the three
nodes is not close to the other two, the weight will immediately decrease. This
emphasises social triangles where the three users are close to each other.
Again, the parameter \emt{\beta} is used to scale the values of the measure with
respect to the geographic span of the entire network.

In the case of directed graphs, as for the standard clustering
coefficient, we consider triangles containing undirected links joining node \emt{i}
to its neighbours and directed links for the remaining side.  If we
consider \emt{\Gamma_i} as the set of all the neighbours of node \emt{i} (considering
    both incoming and outgoing links), with \emt{k_i = | \Gamma_i |}, the geographic
clustering coefficient is defined as:
\begin{equation} 
GC_i = \frac{1}{k_i(k_i-1)} {\sum_{j,k \in \Gamma_i} w_{ijk}}
\end{equation}
\noindent where the sum is extended only to existing triangles. Since there are
exactly \emt{k_i(k_i-1)} different ordered couples of neighbours in
\emt{\Gamma_i}, \emt{GC_i} is normalised to be between 0 and 1 by definition.
This definition also fits the same requirements met by node locality.


\paragraph{Choice of scaling factor}% \emt{\beta}}
When using these geosocial measures on different spatial networks, one should
be able to compare results across them regardless of the specific geographic
span. These measures should scale with the geographic size
of a system, so that when the system is enlarged,  or made smaller,  the
measures  do not change.  For instance, one could compare a spatial social graph
arising between individuals in an urban area, with a maximum span of about
50 km, and a graph formed from connections across a large country, with distances
up to 1,000 km.  In other words, these measures should capture whether
nodes preferentially exhibit short-range rather than  long-distance ties with
respect to the expected spatial dimension of the system.

This is accomplished by choosing an appropriate value for the scaling
factor \emt{\beta} used in Equations~\ref{eq:locality} and~\ref{eq:geoclust}. 
Using the same value for every network, the graph whose nodes are at shorter
distances from each other might have higher values of geosocial measures than
the other. Instead, we want to be able to compare the spatial structure of
two networks even if it arises at different geographic scales, e.g., city-wide
or nation-wide.  A reasonable choice for each network is to adopt a scaling factor
\emt{\beta} equal to the average spatial distance between all its nodes.  This choice is
dependent only on the positions of the nodes of the social graph, not on their
links.  It is worth noting that when considering a single social network the
scaling factor becomes less important and could be ignored by setting
\emt{\beta=1}, for instance.
%in this way we can understand whether the same set of nodes has 
%different geosocial properties when the links between them change.


\subsection{Case-study: assessing the impact of location-sharing}
We explore the effectiveness of our measures by applying them to a set of four
different social services with location information about their users. 

\subsubsection{Traces}
We use traces from four different social services, created with different goals
and offering different features to their users. Two of them, Brightkite and
Foursquare, are location-sharing services: the other two are LiveJournal,
a blogging network, and Twitter.  They all provide static geographic
information about their users, in explicit or implicit form (e.g., geographic
    coordinates or a city name). The traces used for Brightkite and Foursquare
are the same as those described in Section~\ref{sec:methodology}, while for LiveJournal
and Twitter we describe our collection methodology here.

\paragraph{LiveJournal} 
LiveJournal is a community of bloggers with over 10 million active users as of the
end of 2010.  Users can keep a blog or a journal and establish friendship
connections to other users. Each user provides a personal profile, which often
includes home location, personal interests and a list of other bloggers
considered as friends.  Friendship links are not always reciprocal. 
%LiveJournal users often engage
%in social interactions: people can comment on each other's journal
%entries and can also have replies sent directly via email.
The data collection process involved both crawling the social network links
through the API and downloading the user profile pages.
%Seed users were acquired by accessing the public timeline over 24
%hours and then 1,000 users were randomly selected among all the users
%retrieved. 
The duration of the collection was 9 days, from November 2 to
November 9, 2009, obtaining a sample of 1,502,684 users. Given the
1,226,412 users who provided location information, we successfully
obtained a meaningful geographic location for 992,886 users.

\paragraph{Twitter}
Our crawling process was seeded collecting 1,000 seed users from the public
timeline, which shows a list of the 20 most recent tweets posted by users with
unrestricted privacy settings to the entire service.  The duration of the data
crawling was 6 days from December 3 to December 8, 2009, gathering information
about profiles and follower lists for 814,902 different users. Of these,
535,653 reported some information about their home location.  We have
successfully geocoded 409,093 users, translating their location
information into a point on the Earth.
    
To extract and collect information from
LiveJournal and Twitter we crawled a sample of users employing
snowball sampling: the data extraction starts from a set of seed
users and expands the extraction by following the outgoing links of
these users to reach new users, and so on.  
%However, the snowball sampling
%procedure is known to be biased~\cite{Lee2005}, since there is a
%higher probability of sampling nodes with more links. 
%This aspect will be taken into consideration during the evaluation of the
%analytical results.


\begin{table}[t]
\centering
\begin{tabular}{|c|c|c|c|c|c|c|c|c|}
\hline
Dataset & \emt{N} & \emt{K} &  \emt{\langle k \rangle} & \emt{\langle C \rangle} &
\emt{\langle D_{ij} \rangle} & \emt{\langle l_{ij} \rangle} & \emt{\langle NL
  \rangle} & \emt{\langle GC \rangle} \\
\hline
\hline
Brightkite & 54,190 & 213,668  & 7.88 & 0.181 & 5,683 & 2,041 & 0.82 & 0.165 \\
\hline
Foursquare & 258,706 & 2,854,957 & 22.07 & 0.191 & 8,494 & 1,442 & 0.90 & 0.173  \\
\hline
LiveJournal & 992,886  & 29,645,952 &  29.85 &  0.185 &  6,142 & 2,727 & 0.73/0.71 & 0.146 \\
\hline
Twitter & 409,093 & 182,986,353  & 447.29  &  0.207  &  6,087 &  5,117  & 0.57/0.49  & 0.108 \\
\hline
\end{tabular} 
\caption{
  Properties of the datasets: number of nodes \emt{N} and edges \emt{K}, average
  node degree \emt{\langle k \rangle}, average clustering coefficient
    \emt{\langle C \rangle}, average distance between nodes \emt{\langle D_{ij} \rangle} 
 [km], average link length \emt{\langle l_{ij} \rangle} [km],
    average node locality \emt{\langle NL \rangle} (in/out), average geographic
      clustering coefficient \emt{\langle GC \rangle}.}
  \label{tab:four_dataset_properties}
\end{table}

\subsubsection{General properties}
In Table~\ref{tab:four_dataset_properties} we compare some basic properties of
the traces under analysis. The graphs extracted from these services are
different in size: Brightkite and Foursquare have average node degrees
\emt{\langle k \rangle} of 7.8 and 22.0 respectively, LiveJournal has an
average degree of about 30 and Twitter shows a larger value of 447. 
The case of Twitter is peculiar: this social network encourages users
to follow a large number of other users and, since no reciprocation
in link creation is needed, it is easier for a user to accumulate a
large number of social connections. Also, Twitter users may accumulate many
connections as sources of updates, as the service is used as a news feed
by many~\cite{KLP10:twitter}. Our samples show a giant
component containing almost all the nodes in each sample.

Differences between the social graphs  are present also with respect to the average
clustering coefficient \emt{\langle C \rangle}: Twitter and Foursquare have  higher
coefficients of 0.207 and 0.191 respectively, while LiveJournal scores 0.185 and
Brightkite 0.181.   In addition, since LiveJournal and Twitter are represented
as directed graphs,  we report their value of reciprocity
\emt{\rho}~\cite{GL04:reciprocity}, which measures how likely each link is to be
present in both directions and spans from \emt{\rho=1} for perfect reciprocity
  to \emt{\rho=-1} if each link is present only in one direction. We have
    \emt{\rho = 0.69} for LiveJournal, while Twitter has \emt{\rho=0.79}. Hence,
    both networks exhibit high values of reciprocity, although Twitter appears
      more symmetric; this property might be related to the fact that it
      encourages more reciprocal interactions than LiveJournal.



\subsubsection{Geographic properties}
After investigating the social structure of the services, we now analyse their
geographic properties.  One of the most important characteristics is the
geographic distance that social connections span; even if a link between two
users denotes some sort of social relationship, it is also important to take
into account how it stretches across space.  First of all, the networks under
analysis present different values of the average distance \emt{\langle
  D_{ij}\rangle} between users: Foursquare users exhibit an average distance of
  about 8,500 km, while in Brightkite this value goes over 5,600 km and in
  LiveJournal and Twitter it is above 6,000 km. The higher value in Foursquare
  reflects the wider global audience of that service, popular in the USA, in
  Europe and in Asia.
 % Thus, regardless of social
 % connections, Foursquare has a more geographically wide user base, while the
 % other datasets are more widely spread.
\begin{figure}[t]
  \centering
      \includegraphics[width=\plotscale]{images/four_spatial_distances.pdf}
    \caption{Empirical Cumulative Distribution Function (CDF) of the geographic link
      length for the four datasets. Logarithmic binning has been adopted to
        estimate the number of samples in each range of values.}
        \label{fig:four_edge_length_dist}
\end{figure}

In Figure~\ref{fig:four_edge_length_dist} we compare the cumulative probability
distribution of edge length for the four different social networks. The
distributions for Foursquare and Brightkite have already been presented
in Figure~\ref{fig:three_edge_length_dist} and they are reported here again to
aid comparison with the other two services.
Foursquare has the smallest average, only 1,442 km, yet only about
4\% of its links are shorter than 1 km.  Similarly,  only about 4\% of the links
in Brightkite are 
extremely short, with a global average length of 2,041 km. However,
between 60\% and 70\% of links are shorter than 1,000 km in these two
networks.  An opposite trend appears in LiveJournal,  with around 30\% of
links being shorter than 1 km, but an average link length of 2,727 km.  Finally,
Twitter links have an average length of 5,117 km: 
below 5\% of these links are shorter than 100 km,
while more than 80\% are longer than 1,000 km. This is a clear indication that Twitter users are
likely to be engaged with a global audience of followers, even though there
are also short-range social connections. 


\subsection{The effectiveness of geosocial measures}
After discussing the social and geographic properties of these services, we
apply our two novel measures, \textit{node locality} and \textit{geographic
  clustering coefficient}, which blend social and spatial factors together.
    
\begin{figure}[t]
  \centering
      \subfigure[Brightkite]
      {\includegraphics[width=\doubleplotscale]{images/bkite_locality.pdf}}
      \subfigure[Foursquare]
      {\includegraphics[width=\doubleplotscale]{images/fsquare_locality.pdf}}
      \subfigure[LiveJournal]
      {\includegraphics[width=\doubleplotscale]{images/livejournal_locality.pdf}}
      \subfigure[Twitter]
      {\includegraphics[width=\doubleplotscale]{images/twitter_locality.pdf}}
    \caption{Empirical Cumulative Distribution Function (CDF) of node locality
      for each dataset.}
        \label{fig:locality_dist}
\end{figure}

\subsubsection{Node locality}
The probability distributions of node locality for the four datasets are shown in
Figure~\ref{fig:locality_dist}. The main observation is that in Brightkite,
Foursquare and LiveJournal there is a non-negligible fraction of
users with node locality close to 1. Hence, 
there are some users who have social connections only with other individuals
within a close geographic distance. In the Brightkite network
about 40\% of users have a node locality higher than 0.90, and an even higher 
proportion is seen in the
Foursquare dataset. Users of these location-based services exhibit 
high average node locality: Brightkite has an average value of 0.82, while in Foursquare
this value is 0.90.

In the LiveJournal network only 20\% of users have a node
locality higher than 0.90 and the mean values are 0.73 for in-locality and 0.71 for
out-locality.  The node locality distribution appears similar both for in- and
out-locality.  In Twitter the distribution of node locality shows fewer
nodes with high values. This may provide evidence that Twitter users are more
likely to engage with a geographically spread set of individuals rather than
only with users at closer distances. Moreover, in- and out-locality exhibit
different patterns, since there are more than 15\% of nodes with an out-locality
of 0, probably nodes without outgoing connections. The average values are lower
than in the other networks: 0.57 for in-locality and 0.49 for out-locality.
\begin{figure}[t]
  \centering
      \subfigure[Brightkite]
      {\includegraphics[width=\doubleplotscale]{images/bkite_locality_degree_correlation.pdf}}
      \subfigure[Foursquare]
      {\includegraphics[width=\doubleplotscale]{images/fsquare_locality_degree_correlation.pdf}}
      \subfigure[LiveJournal] 
      {\includegraphics[width=\doubleplotscale]{images/livejournal_locality_degree_correlation.pdf}}
      \subfigure[Twitter]
      {\includegraphics[width=\doubleplotscale]{images/twitter_locality_degree_correlation.pdf}}
    \caption{Average node locality as a function of node degree for each dataset. For directed
        networks the relationship is shown both for
            incoming and outgoing links.}
        \label{fig:locality_correlation}
\end{figure}

These results show that location-based services
such as Brightkite and Foursquare  are characterised by short-range
friendship links between users, resulting in a vast proportion of them having
high values of node locality. Thus focussing merely on user location,
rather than on what users share and post, may give more opportunities to
discover potential friends who live nearby. In contrast, these
patterns are not present in social networks that are less centred on
user location; in LiveJournal users have connections with heterogeneous
length and this effect is even greater in Twitter.  Their users may be
more interested in becoming friends with individuals who post and share
interesting content rather than simply with people at close distance.

\subsubsection{Node locality and node degree}
We now analyse the correlation between node degree and node locality to understand
the geosocial properties of users with different numbers of connections.
The average node locality as a function of node degree for
Brightkite and LiveJournal is shown in Figure~\ref{fig:locality_correlation}: node locality
is slowly decreasing with node degree and only users with many
connections have lower values of node locality. Since LiveJournal and Twitter
are modelled as directed graphs we investigate the correlation in both directions. 
In LiveJournal the decreasing trend is evident for both in- and out-locality.
 Twitter users reach a maximum value of out-locality as their number of
outgoing links grows larger than 100, whereas in-locality shows a maximum just
before 100 incoming connections. Both relationships then decrease until they reach a plateau.


While it is expected that nodes with larger degrees exhibit smaller locality
values, since it is statistically more likely that they are connected to distant
users, this behaviour is not observed in Twitter: users with about 10 outgoing
connections have lower values of out-locality, but as the out-degree grows there
is a maximum at 100.  One possible explanation is that users with a small number
of links are probably mainly connected to popular accounts, e.g., institutions,
media and commercial entities, which are usually not geographically close to
them. 

Note that during the data collection, when users joined Twitter they were
presented with a list of 20 popular users to follow; these celebrities are
unlikely to be located close to the joining user. As a consequence, people who
just added those suggested connections when they joined the service may have
ended up with
a small number of connections which are not close from a geographic point of
view. 
    
%These values might denote users which are actively
%using Twitter mainly as a social networking service: these users presents a
%fairly small amount of connections and may fall within
%the 'acquaintances' category, denoting individuals which tend to reciprocate
%social connections and which are actively pursuing social interaction rather
%than just broadcasting their latest news (as the 'broadcasters' group) or trying
%to advertise their content (as the 'spammers' class).
%
%In the case of Twitter, where reciprocity is highly heterogeneous, it is worth
%analysing how the node locality depends on the in-degree/out-degree ratio. As
%shown in Figure~\ref{fig:twitter_ratio_corr}, node locality has generally a
%local minimum as the ratio approaches 1: this may it is clear how the highest
%values of in-locality appear for users with a ratio between 1 and 10: these
%users are followed by more users than the number of users they follow, but the
%proportion is not completely unbalanced as for global celebrities, which may
%have a ratio above 1,000 or 10,000. Thus, these users might be \textit{local
%    celebrities}, since they are followed by an audience which is mainly
%    localised. On the contrary, then the ratio increases the node in-locality
%    goes down, denoting a more global audience.  a few dozens of followers just
%    because more followers (incoming links) than ????


\begin{figure}[t]
  \centering
      \includegraphics[width=\plotscale]{images/four_geoclustering.pdf}
    \caption{Empirical Cumulative Distribution Function (CDF) of geographic clustering
        coefficient for each dataset. Logarithmic binning has been adopted to estimate the
    number of samples in each range of values.}
        \label{fig:geoclust_dist}
\end{figure}

\subsubsection{Geographic clustering coefficient}
The other geosocial measure that we have studied is the geographic clustering
coefficient. Since social networks are widely known to be characterised by the
presence of triangles, the aim of this measure is to understand whether triplets
of mutually connected users are more likely to be geographically close or,
instead, distant from each other. A user with high geographic
clustering coefficient has neighbours who are tightly interconnected and
close to the user themselves and to each other. The four datasets exhibit
different values of geographic clustering coefficient; while Brightkite has
an average value of 0.165 and Foursquare of 0.173, the average for LiveJournal
is 0.146 and Twitter scores 0.108. Also, the first two datasets exhibit a
geographic clustering coefficient close to their standard clustering
coefficient, while LiveJournal and Twitter present lower values when 
geographic distance is taken into account. Thus, in the former two networks clusters
form at shorter distances than in the latter. 

The probability distributions of the geographic clustering coefficient are shown
in Figure~\ref{fig:geoclust_dist}. The higher mean values for Brightkite and
Foursquare are explained by the fact that a non-negligible portion of users have
a coefficient of 1.0: about 10\% in Brightkite and about 5\% in Foursquare. On
the other hand, in Twitter higher values are less likely to be observed and
there is no discernible proportion of users with a coefficient of 1. These
results show that location-based platforms tend to have more geographically
confined triangles than social networks more focussed on content production and
sharing such as Twitter.

\begin{figure}[t]
  \centering
      \subfigure[Brightkite]
      {\includegraphics[width=.48\columnwidth]{images/bkite_geoclustering_degree_correlation.pdf}}
      \subfigure[Foursquare]
      {\includegraphics[width=.48\columnwidth]{images/fsquare_geoclustering_degree_correlation.pdf}}
      \subfigure[LiveJournal]
      {\includegraphics[width=.48\columnwidth]{images/livejournal_geoclustering_degree_correlation.pdf}}
      \subfigure[Twitter]
      {\includegraphics[width=.48\columnwidth]{images/twitter_geoclustering_degree_correlation.pdf}}
    \caption{Average geographic clustering coefficient as a function of node
        degree. For directed networks the relationship is shown both for in- and
            out-degree. }
        \label{fig:geoclust_correlation}
\end{figure}

\subsubsection{Geographic clustering coefficient and node degree}
We now investigate the relationship between geographic clustering coefficient
and node degree. As reported in Figure~\ref{fig:geoclust_correlation}, in Brightkite,
Foursquare and LiveJournal the geographic clustering coefficient steadily
decreases as the number of neighbours grows: thus, if a user has only few
friends they are more likely to create connections with people nearby. On the
other hand, Twitter shows a different behaviour: the geographic clustering
coefficient is slowly decreasing as the degree increases, but then it grows
again until reaching a local maximum around the value of 1,000, while it
decreases again for larger degrees. 

This particular property of the Twitter network may be explained by
the existence of users which are popular only in a particular region: they have
both incoming and outgoing links with a large audience which has, however,
     several interconnections on a confined scale. Indeed, a user which is
     locally popular might have lower values of node locality because of his/her
     large audience, as shown in Figure~\ref{fig:locality_correlation}, but users
     which are following him/her are also likely to share the same
     interests (since they follow the same popular user) and to become connected with each
     other.  Instead, when a user reaches a wider popularity, his/her
     followers will be both more geographically spread and less
     interconnected. 
    % Further investigation on this point may unravel interesting
    % findings. 


\section{Discussion and implications}
\label{sec:disc3}
In this chapter we have seen that spatial distance affects the social structure
of online services and that users exhibit different geographic and social
characteristics, with weak positive correlation between the number of friends and
their average distance. Also, a similar heterogeneity appears with respect to
social triads, with users participating in geographically wider triangles as
their degree increases. Our findings appear robust across the three traces under
analysis, as they arise regardless of the particular service we consider, the
data collection methodology, the time elapsed since the creation of the service
or the number of users in the social graph.  However, the  properties we observe
in the real systems do not appear in the two randomised versions of these
networks; therefore, \textit{their socio-spatial structure cannot be explained
by taking into account only geographic factors or social mechanisms.}


Indeed, this claim can be further supported by  considering the average length
of a link \emt{l_{ij}} as a function of the \textit{product of the degrees}
\emt{k_i k_j}.  As observed in Figure~\ref{fig:gravity}, longer links tend to
arise between users with more friends, while links connecting users with fewer
friends tend to be much shorter. This effect signals significant correlation
between users' social properties and their spatial behaviour. In fact, it is not
seen at all in the Social model; this suggests that there might be an
underlying spatial process taking place that results in this correlation, since
social ties are not equally likely to appear regardless of their geographic
length.  On the other hand, the Geo model exhibits the opposite trend, with
shorter links appearing mainly between well-connected users.  Hence, distance is
not the only factor affecting the link formation process; in other words, when
only mechanisms that depend on geographic distance are in place, a user
accumulates many friends only where there are many potential friends living
nearby, i.e., if he/she is located in an area with high density of users.
Furthermore, this geographic model cannot reproduce how some users accumulate
thousands of friends, creating a heavy-tailed degree distribution.
\begin{figure}[t]
  \centering
    \subfigure[Brightkite]{
      \includegraphics[width=\tripleplotscale]{images/bkite_gravity.pdf}}
    \subfigure[Foursquare]{
      \includegraphics[width=\tripleplotscale]{images/fsquare_gravity.pdf}}
    \subfigure[Gowalla]{
      \includegraphics[width=\tripleplotscale]{images/gowalla_gravity.pdf}}
    \caption{Average link length \emt{\langle l_{ij} \rangle} as a function of the
        product of the node degrees \emt{k_i k_j}
        for the original network and for its two randomised versions.}
        \label{fig:gravity}
\end{figure}

Since both social factors and geographic space need to be considered when
studying these systems, we have proposed two new geosocial measures, node
locality and the geographic clustering coefficient, that take spatial distance
into account.  Using these novel network measures we have been able to study
four different online services from a social and  a spatial perspective at the same
time. Our measures have highlighted that purely location-based social
networking services, which mainly focus on the geographic dimension of social
interaction, tend to have users with stronger preference for spatially
short social ties. In contrast, services based more on the idea of sharing information and
content have users with lower node locality and geographic clustering
coefficient values, showing that space has a weaker effect on these services.

All these findings support the claim that accurate modelling of spatial
social networks requires the incorporation of processes combining social and
spatial factors.  We have discussed how the effect of node degree on network
evolution is captured well by the preferential attachment model, where the
probability of connection between nodes \emt{i} and \emt{j}, \emt{P_{ij}}, is
proportional to the degree of node \emt{j}, \emt{P_{ij} \propto k_i}.  The
effect of geographic distance can be included in this attachment probability,
\emt{P_{ij} \propto k_i k_j f(D_{ij})}, where \emt{f} is a decreasing
\textit{deterrence function} of the geographic distance \emt{D_{ij}}
between the nodes. Thus, long distances tend to be covered only to
connect to important hubs, while nodes with fewer connections become
attractive when they can be reached over a short distance.  When the
deterrence function has a simple functional form such as \emt{f(d) \sim
d^{-\alpha}}, then the probability of a connection between two nodes
becomes similar to the gravitational attraction between celestial
bodies, \emt{P_{ij} \propto \frac{k_i k_j}{D_{ij}^{\alpha}}}. Hence,
these attachment models are known as \textit{gravity
models}~\cite{Car56:gravity, ES90:gravity}.

%attachment model, but geographic distance is an important parameter as well. In
%fact, long-range connections tend to exist mainly between well-connected
%hubs~\cite{ES90:gravity, JWS08:korean, KCR09:gravity}.
%In this formulation the intensity of
%interaction between two spatial nodes \emt{i} and \emt{j} is proportional to 
%\begin{equation}
%\frac{N_i N_j}{ f(d_{ij})}, 
%\end{equation}
%\noindent where \emt{N_i} is the importance of node \emt{i}, \emt{d_{ij}} is their
%distance and \emt{f(d)} is a deterrence function  which captures spatial
%effects. 
A gravity model balances the effect of
spatial distance with other node properties; the underlying assumption is that
longer (and more expensive) ties will appear mainly between important entities,
while a node will connect to an unimportant one only if they are close to
each other. These models have long been used to model connections in spatial networks such
as trade flows across countries~\cite{BMS08:imf}, traffic flows in highway
networks~\cite{JWS08:korean} and mobile phone calls between
cities~\cite{KCR09:gravity}.

Gravity models are only a first tentative step to reproduce the patterns we have
observed. In particular, gravity models only focus on pairs of nodes, without
taking into account social effects such as triadic closure and focus
constraint~\cite{Gra73:ties, Fel81:focus}.  Furthermore, even though the degree
of a node represents a reasonable choice as a ``mass'' variable for a social
gravity model, any notion of ``importance'' in a social network might
avoid quantification, thus making the definition of a sound social gravity
model hard to specify. Such individual importance may be an exogenous variable
which affects the socio-spatial structure, such as being a well-known celebrity
or any other type of individual popularity or social influence measure. It is
likely that any social gravity model will need to take into account this type
of heterogeneity across individuals.  We will explore these issues in
more depth in Chapter~\ref{ch:model}.
%Despite these difficulties, 
    
%As people spend more time online, more and more data
%will be available regarding their spatial behaviour and their social connections,
% allowing more reliable and precise models to be built. 
%Such models present many potential applications in the design of any type of location-based
%service, but also important implications for other systems such as security
%mechanisms, user identification techniques and recommendation engines. 



\section{Related work}
\label{sec:related3}
In this chapter we have presented our findings concerning the spatial properties of
online services, studying the spatial
structure of the social graph arising among individual users. Hence, the main
thread of research related to these results involves studies of the spatial
properties of complex networks and social systems.

%\subsection{The influence of space on complex and ties}
The effect of geography on complex networks has been studied mainly in
systems such as transportation networks and infrastructure networks, that is,
structures able to convey energy, matter or information at different scales and
in different scenarios. Some examples include Internet router
connections~\cite{YJB02:internet, BGG03:internet}, airline flights between
airports~\cite{GN06:spatial}, subway networks~\cite{LM02:boston}, electrical
power grids~\cite{SRC08:grid}, urban road networks~\cite{CLP06:urban,
CSLP06:cities}, maritime cargo shipments~\cite{HZ09:cargo}  and other systems
where nodes are embedded in a metric space. A more complete review discussing
many of these examples in depth has been compiled by
Barth\'{e}lemy~\cite{Bar11:spatial}. 

%
%%communication and transportation networks~\cite{GN06:spatial}. For instance, it
%%has been found that the spatial properties of the Internet topology are mainly
%%determined by both preferential attachment and linear distance
%%dependence~\cite{YJB02:internet}, whereas Internet traffic is spatially bound to
%%a spanning network which connects the most important centers around the
%%globe~\cite{BGG03:internet}. 
%
%More in general, the spatial structure of networks has been extensively studied,
%particularly when dealing with transportation and mobility networks,
%

%In such examples metric distance directly influences the network structure by
%imposing higher costs on the connections between distant entities. The overall
%effect is that longer links must be compensated by some advantage, thus drawing
%connections to nodes that are already well-connected. On the other hand, but
%with equal importance, short links tend to rise even to nodes that might not
%provide a notable advantage, because such connections are comparably cheaper to
%establish. This trade-off may has a considerable impact on the overall structure of
%the network and, consequently, on the processes that take place on it.


This abundance of work on spatial networks does not extend to social systems:
mainly because geographic data about individuals have been difficult to obtain,
especially for large systems, social network analysis has often neglected the
spatial perspective.  Thanks to some dedicated and relatively small-scale data
collection efforts, some sociologists have studied the effect of distance on
social ties. For instance, through a longitudinal study spanning different
decades, Mok and Wellman discuss how spatial proximity influenced social
interactions among resident of a neighbourhood in Toronto before the advent of
the Web~\cite{MW07:distance}, but also when  online communication tools became
widely available~\cite{MW09:distance}. Their results suggest that the effect of
distance remained strong over 20 years, even though communication was made
easier and more effective over the entire range of geographic distances.

%Yet, this effect may not be as important in other types of social interaction,
%particularly when focusing on connections mainly developed through new
%communication channels.  
   Some studies have explored a fundamental
spatial property of social networks: the probability \emt{P(d)} of having a social
connection between two individuals as a function of their distance \emt{d}. Even
though there seems to be agreement that \emt{P(d)} decreases with
distance, the exact relationship between these two variables is still unclear.  
Lambiotte et al.\cite{LBD08:mobile} found that it decays as \emt{P(d) \sim
d^{-2}} in a mobile phone communication network, while Liben-Nowell et
al.~\cite{LNKR05:routing} found a different relationship \emt{P(d) \sim
d^{-1} + \epsilon} among online bloggers on LiveJournal in the USA,
\emt{\epsilon} being a constant probability which acts on online communities
regardless of distance.  In another study, Backstrom et al.~\cite{BSM10:findme}
have similarly found spatial scaling \emt{P(d) \sim 1/d} of Facebook
connections; they show that this association appears so strong and important
that it can be safely exploited to infer where users are located only from the
location of their friends. 

It has also been proposed that the spatial structure of social networks might
be scale-invariant, with a universal distribution \emt{P(d) \propto
d^{-1}}~\cite{HWL09:universal}. Butts suggests that this relationship between
physical distance and connection probability in social networks is so important
that it can be used to explain entropy and predictability of
the social structure, provided that an upper bound can be defined for the 
likelihood that distant individuals are connected~\cite{But03:predictability}.
Our study goes beyond the investigation of this basic relationship between ties
and distance, analysing user heterogeneity and addressing the interplay between
spatial and social factors.

Finally, a few other studies have investigated the structural properties of a
location-based social network and how social and geographic distance influences
the creation of new connections between its users~\cite{LC09:lbs, LC09:model}.
Such investigations fail to address the heterogeneity we have observed across
users and do not discuss whether the global structure of the network is 
influenced by  social and  spatial factors. In doing this, our findings represent
some important initial steps towards a better and more comprehensive
understanding of the spatial properties of online social networks.

\section{Summary}
\label{sec:concl3}
Location-sharing features offered by online social services allow users to
engage in a new way with the physical world around them and with their friends.
At the same time, as individuals create and share location-tagged information,
they reveal their geographic position and their spatial movements. In order
to understand the effect of space and distance on online social connections we need
to measure and interpret traces extracted from such services.

In this chapter we have addressed these issues, offering a series of results. We
have presented a large-scale study  based on traces collected from mobile
location-based social services. We have used check-in data to assign  geographic
positions to users and to study their social graph as a spatial network. Through a
comparative study of three different services, and by adopting
randomised null models to disentangle social from spatial factors, we have
discovered that the spatial properties of users are heterogeneous.
These findings will be revisited in Chapter~\ref{ch:model} when we
discuss the temporal evolution of a spatial social network.

Finally, we have devised two new network measures that combine
social and spatial characteristics: node locality and the geographic
clustering coefficient. We have explored their potential by using them to
assess the effect of spatial distance on different online social services.
We will also see in Chapter~\ref{ch:caching} that these measures can be
successfully used when designing content delivery distribution systems.
